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SUMMARY 

In this paper we outline the framework of mathematical statistics with which one may 
study the properties of galaxy distance estimators. We describe, within this framework, 
how one may formulate the problem of distance estimation as a Bayesian inference prob- 
lem, and highlight the crucial question of how one incorporates prior information in this 
approach. We contrast the Bayesian approach with the classical 'frequentist' treatment of 
parameter estimation, and illustrate - with the simple example of estimating the distance 
to a single galaxy in a redshift survey - how one can obtain a significantly different result 
in the two cases. We also examine some examples of a Bayesian treatment of distance 
estimation - involving the definition of Malmquist corrections - which have been applied 
in recent literature, and discuss the validity of the assumptions on which such treatments 
have been based. 

1 INTRODUCTION 

Recently, the estimation of galaxy distances has assumed great importance in cosmology. 
The analysis of large-scale galaxy redshift surveys, used in conjunction with redshift- 
independent galaxy distance estimates, can place powerful constraints on the values of 
the cosmological parameters H and fi (c.f. Hendry, 1992b; Dekel, 1994), and in prin- 
ciple can allow one to test several of the hypotheses - including the form of the initial 
spectrum of density perturbations, the role of gravity in the growth of structure and the 
clustering properties of dark matter - on which current theories for the formation of large 
scale structure in the universe are largely based. Various methods have been developed to 
reconstruct the density and three-dimensional peculiar velocity field from galaxy redshift 
and redshift-distance surveys (c.f. Dekel et al, 1990, 1993; Simmons, Newsam & Hendry, 
1995; Rauzy, Lachieze-Rey & Henriksen, 1994), based upon the ansatz that the peculiar 



velocity field is a potential field - an idea first developed in the potent reconstruction 
method (Bertschinger & Dekel, 1989). At the same time, new statistical methods of 
analysing surveys which consist of redshifts alone have been developed (c.f. Lahav et al, 
1994; Fisher et al, 1994; Heavens & Taylor, 1995) based upon the description of the large 
scale density and velocity field in terms of sets of orthogonal functions. One of the biggest 
current challenges in this field is to combine in an optimal fashion the results of these 
two different methods of analysis, in order to place stronger constraints on cosmological 
models and the values of cosmological parameters - a subject which would merit an entire 
article in itself. In this article we will focus instead only on those issues which concern 
the former group of reconstruction methods - i.e. where one attempts to obtain redshift- 
independent distance estimates to galaxies. 

Attempts to map the large scale structure of the universe from redshift-independent 
galaxy distance estimates have not been without controversy. For many years considerable 
debate has been generated over the precise nature, or indeed the very existence, of galaxy 
concentrations such as the 'Great Attractor' in the direction of Hydra and Centaurus, for 
example (Lynden-Bell et al, 1988; Dressier & Faber, 1990; Mathewson, Ford & Buchhorn, 
1992; Federspiel, Sandage & Tammann, 1994). A significant factor fuelling this contro- 
versy has been disagreement not so much over the astrophysical problems of determining 
'good' galaxy distance indicators (although this has undoubtedly played a part also) but 
rather disagreement over the equally fundamental question of what statistical methods 
one should adopt to analyse the galaxy data. In this paper we attempt to clarify and 
place in the open some of the different statistical approaches which have been adopted in 
this field of cosmological research, and to discuss - within the framework of mathematical 
statistics - the different underlying philosophies upon which (often implicitly) they are 
based. Our discussion should be viewed as a general introduction to the problem, suitable 
for a reader previously unfamiliar both with the relevant astronomical details of measur- 
ing galaxy distances and with the basic theory of probability and statistics upon which 
the topic is founded. References to more detailed articles, covering both the astronomical 
and statistical aspects of the problem, will be given wherever appropriate. 

The measurement of the distance of a galaxy, is an example of an inference problem: 
i.e. one cannot measure the distance directly but must infer it from the measurement 
of some other physical characteristic, such as the apparent visual magnitude or angular 
diameter. If one knew precisely the absolute magnitude or intrinsic diameter of the galaxy 
then one could immediately arrive at an exact determination of the galaxy distance. In 
early studies of the large scale distribution and motion of galaxies (c.f. Rubin et al, 1976; 
Sandage & Tammann, 1975a,b) the approach was simply to assume a priori some fiducial 
value for this absolute magnitude or diameter and thus infer galaxy distances on that ba- 
sis. In practice, however, not all galaxies have the same absolute magnitude or diameter 
and so the inference is statistical in nature. Shortly after these early studies significant 
progress was made with the identification of empirical relationships between absolute 
magnitude and diameter and other, distance-independent but directly measurable, phys- 
ical quantities such as velocity dispersion or colour (c.f. Faber & Jackson, 1976; Tully 
& Fisher, 1977; Visvanathan & Sandage, 1977). The Tully-Fisher relation, for example, 
essentially expresses a power law relationship between the luminosity and the rotation 
velocity - as measured from e.g. the 21cm neutral hydrogen radio emission - of spiral 
eralaxies. Thus one measures the 21cm line width of neutral hvdroffen for a, given sniral 



galaxy, applies the Tully-Fisher relation to infer the absolute magnitude of the galaxy, 
and then infers the galaxy distance from its observed apparent magnitude. 

In the past decade the Tully-Fisher, and other similar, relations have been further 
refined and placed upon a firmer theoretical footing, (Pierce & Tully, 1988; Salucci, Frenk 
& Persic, 1993; Hendry et al, 1995) but they still contain a significant degree of intrinsic 
scatter and so do not provide an exact determination of absolute magnitude or diameter. 
Hence, the galaxy distance inferred from such a relation is still inherently statistical. In 
the language of mathematical statistics, the intrinsic scatter of the relation means that 
we can construct only an estimator of the galaxy distance, and that distance estimator 
will itself be subject to error. More formally, the distance estimator is a random variable 
with a definite distribution function, or equivalently probability density function (pdf), 
and a fortiori mean and variance. 

Unfortunately there is no unique way to construct distance estimators. One can make 
a choice of distance estimator which has certain desirable properties, the most obvious 
being that its distribution should have a small 'spread', or variance; on average over many 
realisations the estimator should give the true distance of the galaxy; and the estimator 
should use all of the information about the galaxy distance available in the data. These 
rather loosely defined properties have their corresponding rigorous definitions in the sta- 
tistical literature, and these are referred to as efficiency, unbiasedness and sufficiency 
respectively. 

One should remark that when measurement errors and intrinsic variability are small 
in the physical system which one is modelling, then the adoption of a broad class of dif- 
ferent statistical methods - or even different statistical philosophies - in testing models 
from observational data will usually make little difference to one's conclusions. Large 
discrepancies in the conclusions reached by various authors in the literature concerning 
the estimated distances of galaxies and clusters therefore arise primarily because of large 
intrinsic uncertainties inherent to the data. In other words, galaxy distance indicators 
are noisy, with typical distance errors from, e.g., the Tully-Fisher relation of around 20% 
or larger to individual galaxies. It is this fact which makes the question of how one ap- 
proaches the problem of choosing the 'best' galaxy distance estimator a non-trivial, and 
an extremely important, one. The typical size of distance errors has led many cosmolo- 
gists to attempt to incorporate prior information on the distribution of galaxy distances 
when defining distance estimators, with the aim of reducing the uncertainty in the final 
estimate. All examples of this approach can be traced back to what is termed in the sta- 
tistical literature as a Bayesian treatment of the problem of distance estimation, although 
references in the cosmology and astronomy literature have often not explicitly used the 
term 'Bayesian', nor indeed used wholly orthodox Bayesian methods, in their description 
of the problem. There are indeed some difficulties with this approach. One the one hand 
there are philosophical and methodological problems that have long been recognised and 
debated by statisticians (c.f. Kendall & Stuart, 1963; Mood & Graybill, 1974; von Mises, 
1957; Feigelson & Babul, 1992) which go to the root definitions and concepts in the theory 
of probability. On the other, there is often no clear-cut way of deciding upon the nature 
of prior information one can justifiably use. This paper is not the appropriate place to 
discuss either of these questions in any great depth. We would like to emphasise here, 
however, the nrincinle emnloved in Bavesian inference nroblems in the general statistics 



literature: that results which depend heavily on the choice of prior information should be 
treated with caution. 

Whilst the problem of galaxy distance estimation raises certain statistical issues which 
are somewhat unique to astronomy - in particular the important role of observational se- 
lection effects and the modelling of the physical processes underlying the various distance 
relations which are applied to galaxies - the fundamental concepts are precisely the same 
as one finds in the general statistical literature on inference problems and estimation. It 
seems sensible, therefore, for cosmologists to make full use of the 'machinery' - the defini- 
tions, notation and general results - developed by statisticians for tackling such problems. 
In this paper, as in our earlier papers on this subject, we shall attempt to adhere to this 
practice. 

The structure of this paper is as follows. In section 2 we discuss in more detail the 
nature of distance estimators, placing our discussion in the rigorous context of mathe- 
matical statistics and introducing the appropriate notation and conventions. We go on to 
discuss the role of prior information, to explain the concepts of a 'Bayesian' approach to 
estimation problems, and to examine the relationship between Bayesian and more ortho- 
dox or 'frequentist' approaches. We show, by means of the simple example of estimating 
the distances to galaxies in a single catalogue, how a Bayesian and frequentist approach 
will yield different results. In section 3 we discuss the various galaxy distance estimators 
which have been used in recent literature, drawing particular attention to the statistical 
'philosophy' (i.e Bayesian or frequentist) upon which they are based, the validity of the 
assumptions inherent in their definition, and the extent to which they can be regarded as 
'good' estimators - in the sense of e.g. unbiasedness, efficiency and sufficency, as intro- 
duced above. Finally we discuss the practical outcomes of using these different estimators 
for determining distances to individual galaxies and clusters and in the analysis of the 
peculiar velocity and density field by, e.g., the potent based methods mentioned above. 

2 STATISTICAL PROPERTIES OF DISTANCE ESTIMATORS 

One of the purposes of this section is to clarify our notation and statistical approach for 
the benefit of the reader previously unacquainted with the general statistics literature. In 
the interests of brevity we shall present here only the essential ideas and omit unnecessary 
detail, perhaps at the risk of appearing simplistic. A more thorough, and wholly rigorous, 
treatment of the mathematical foundations of parameter estimation can be found in a 
large number of textbooks on probability and statistics (c.f. Hoel, 1962; Kendall & 
Stuart, 1963; Mood & Graybill, 1974; Hogg & Craig, 1978) 

What Is an Estimator? 

In rough terms, an estimator of some unknown parameter is a rule based on statistical 
data - i.e. a random sample drawn from some underlying population - for estimating 
the value of that parameter. If the parameter of interest is q then we shall write q to 
denote an estimator of q, following the standard statistical convention. Note that q is 
written in bold face to indicate the fact that it is a random or statistical variable (since 
it is a function of data which are themselves statistical variables), again in keeping with 



true value of q, qo say, for every set of statistical data, but we would regard an estimator 
as 'good' if it tends to yield the value q 'on average', or 'in the long run' - rather vague 
statements which can be quantified in terms of the bias and loss function associated with 
the estimator chosen, as we discuss below. 

By way of an illustrative example, a simple galaxy distance estimator could be con- 
structed only from the obaserved apparent magnitude of a galaxy (c.f. Hendry & Simmons, 
1990; Hendry, 1992a). Thus we may write 

m - M = 51ogr + 25 (1) 

where r is the true distance, measured in Mpc, and m and M denote the apparent and 
absolute magnitude of the galaxy respectively. Of course the actual distance of the galaxy 
can only be obtained if there is no error on the measured value of m and if M is known. 
We can estimate r, however, by making some assumption about the value of M (for 
simplicity we shall ignore any error on m in this discussion) and solving for r in equation 
(1). Suppose we take the value of M to be the mean value of absolute magnitude, Mo say, 
for the underlying population of all galaxies of a certain Hubble type. We thus obtain an 
estimator of log distance, viz 

logr = 0.2(m - M - 25) (2) 

Here the hat indicates an estimator. If we consider that the galaxy has been randomly 
selected from an imaginary population of galaxies all at the same distance, but with 
different absolute magnitudes, then logr must be considered to be a statistical variable, 
as noted previously. The statistical properties of log r depend on the galaxy luminosity 
function and on the selection function which determines whether a galaxy will or will not 
be observed at true distance, r. It follows from equations (1) and (2) that we may write 

logr = logr + 0.2(M-M ) (3) 

For brevity we shall in future refer to logr as w, and logr as w. 

In general, the underlying pdf before selection for M is not known. This pdf is usu- 
ally assumed to be independent of position and is just the luminosity function (LF) of 
M, written \1/(M). The distribution, \J/ b s (M|r), of M for observable galaxies at actual 
distance, r, will depend upon the selection function and indeed also on r (although for 
simplicity we assume here no dependence on direction). Once this pdf, \& b s (M|r), is given 
the pdf of any function of the random variable, M, may be determined. In particular the 
pdf of w defined by equation (3) may easily be found. Note that while w itself does not 
depend on w, the pdf of w does depend upon the true value of the parameter, as one 
might expect. 

Biased and Unbiased Estimators 

The mean, or expected, value of a random variable associated with a galaxy may be taken 
with respect to either the observable or the intrinsic galaxy distribution. We shall almost 
invariably consider the expectation with respect to the observable distribution in this 
paper. Thus the estimator of log distance, w, is defined to be unbiased if 



where the expectation value of any function, /(M), of M is defined as 



E[f(M)\w}= f(M)V(M\w)dM 



(5) 



The bias, B(w), is defined as 



B(w) = E[w\w] — w 



(6) 



When a galaxy survey is subject to a selection limit on apparent magnitude, the esti- 
mator of log distance given by equation (2) is biased for all true log distances. Moreover, 
simply replacing the mean absolute magnitude, M , of the underlying population by some 
fiducially corrected value, M + c, where c is a constant, cannot eradicate this bias (c.f. 
Hendry & Simmons, 1990). One can apply an iterative procedure - effectively adding a 
non-constant correction to Mq - which considerably reduces the bias of w, although this 
procedure does not converge to an unbiased estimator for all log distance (Hendry, 1992a). 
It has been shown, however, (c.f. Schechter, 1980; Hendry & Simmons, 1994) that in the 
case of a relation of Tully-Fisher type - where one has an additional observable correlated 
with absolute magnitude - if the second observable is free from selection effects then one 
can define an estimator which is unbiased at all true log distances. We return to this 
issue in section 3. 

Minimum Variance and Efficient Estimators 

There are obvious advantages in using unbiased estimators: in particular, for large samples 

- e.g. when one is estimating the distance to a rich cluster of galaxies - the mean 
estimated distance for the sample will also be unbiased, and of course will have decreasing 
variance as the sample size increases. Furthermore, if we are interested in, say, the 
distribution of actual distances of a catalogue of galaxies, the histogram of estimated 
distances can be readily deconvolved to yield an estimate of this underlying distribution 
of true distances. For biased estimators this would be more difficult (c.f. Eddington, 
1913; Newsam, Simmons & Hendry, 1994, 1995). Similarly, in model fitting problems 

- the simplest of which in the present context is e.g. the determination of the Hubble 
constant - we can expect parameter estimation to be much easier if we begin with unbiased 
estimators. Unbiasedness is not the only criterion for choosing an estimator, however. It 
is also natural to desire the estimator to have a small variance. The variance, V(w), of 
an estimator is defined as 



In practice one finds that there is a trade-off between small variance and small bias, in 
the sense that if you reduce one then you increase the other. The Cramer-Rao inequality 
places a lower bound on the variance for both biased and unbiased estimators (c.f. Hogg 
and Craig, 1978; Hendry, 1992a; Gould, 1995; Zaccheo et al, 1995), and an efficient esti- 
mator is one which attains that lower bound - i.e. which is a minimum variance estimator. 

In choosing an estimator it is also usually convenient to introduce a loss function, which 
essentially quantifies the 'loss', or cost, of making an incorrect estimate of a parameter. 
An obvious loss function to consider is 




(7) 
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A good estimator should yield low values of the expected loss for a large range of 
values of the parameter w. This expected loss is called the risk, i.e. 

R(w) = E[L(w,w)] (9) 

Note that for an unbiased estimator the risk and variance are identical, but for a 
biased estimator the risk is always strictly greater than the variance. Thus, if one has an 
estimator with small variance but large bias, this would still result in an estimator of large 
risk - indicating that risk is often the more meaningful quantity in comparing estimators. 
In general the bias, variance and risk of an estimator are related by the following simple 
expression 

R(w) = V(w) + \B(w)] 2 (10) 

Sufficiency 

In estimating the distance of a galaxy one does not generally adopt an estimator of the 
simple form of equation (2), which is a function only of apparent magnitude, but rather 
makes use of a distance indicator such as the Tully-Fisher relation which depends upon the 
the strong correlation between absolute magnitude and some other distance-independent, 
directly measurable, observable. Since the underlying physical relationship in an indicator 
of this type is unlikely to depend upon only two variables, one could in principle construct 
a distance estimator as a function of an arbitrary number of observables, or statistics. 
The bias and risk of such an estimator would depend, of course, upon how well correlated 
were the observables. In Hendry & Simmons (1994) the general case of estimators formed 
from three correlated observables is formulated, and in Kanbur & Hendry (1995) a specific 
example is considered where the addition of a fourth observable - the maximum apparent 
magnitude - to the period, mean luminosity, colour relation for Cepheid variable stars 
does indeed result in a distance estimator of significantly smaller variance and risk. 

A obvious general question to ask, then, is whether there exists a function, say 
w(xi, x„), of a set of observables, xi,...,x n , which 'contains' all of the information 
about the true value of w. Such a function is known as a sufficient statistic - and hence 
would define a sufficient estimator - for w, and so should be preferred over another es- 
timator without this property. The property of sufficiency can be given a more rigorous 
mathematical definition in terms of the joint pdf of x 1? x n and w (c.f. Mood & Graybill, 
1974). Suppose that w*(x!, x n ) is another statistic based on the observables, x 1? x n , 
which is not a function of w. Then w is defined to be sufficient if, for any such w*, the 
conditional distribution of w* given w does not depend on the true parameter value, w. 

This definition essentially states that once the value of the sufficient statistic has been 
specified, one cannot find any other statistic based on the same set of observables which 
gives any further information about the true value of w. In a sense, w 'exhausts' all the 
information about w that is contained in the observed values of xi, ...,x„. 

Bayes' Estimators 

So far we have said nothing about the incorporation of prior information in the estimation 
of galaxy distances. Bayesian approaches attempt to do precisely this. 



A fully fledged Bayesian approach would regard w - in the above notation - not as 
a parameter, but as a statistical variable. The probability (more commonly referred to 
as the likelihood) of this variable taking any given value would be determined by what is 
known as its prior distribution: prior, that is, to the data that we presently have at hand. 
In the cosmological setting, therefore, w - the log distance of a galaxy - would be taken 
to have a prior distribution before the apparent magnitude or diameter or line width of 
this galaxy were measured. This prior distribution would be based on previous informa- 
tion about the distribution of galaxies as a whole - or even preconceptions about this 
distribution, such as the assumption that the spatial distribution of galaxies be uniform. 
In this case one has to modify the orthodox frequentist view of probability as a 'limit' of 
relative frequencies and adopt instead a view of probability as a measure of one's state of 
knowledge about a random variable. 

The posterior distribution for w, once the data for a particular galaxy has been taken 
into account, is then obtained by applying Bayes' theorem. Suppose one's distance esti- 
mator is a function of two variables, m and P - denoting for example apparent magnitude 
and log rotation velocity for the Tully Fisher relation. Bayes' theorem states that 

p(m, P|w)p(w) = p(w|m, P)p(m, P) (11) 

Taking p(m, P) to pe a constant, one obtains the posterior distribution for w, viz 

p(w|m, P) = Cp(m, P|w)p(w) (12) 

where p(w) is the prior, C is a normalisation constant and p(m, P|w) is the conditional 
probability of m, P given w. 

This approach in itself does not give an estimator of w, which is a statistical variable 
and not strictly speaking a parameter in the Bayesian context, but rather it gives a 
posterior pdf for w from which one may define a Bayes' estimator (c.f. Mood & Graybill, 
1974) in the following way. A Bayes' estimator, Wb ayes minimises the risk, R(w) averaged 
over the prior distribution, p(w) for w. Thus for a Bayes estimator the integral 

jR(w)p(w)dw (13) 

is a minimum. It can be shown that a Bayes' estimator in fact minimises the loss function 
averaged over the distribution for w conditional on the observed data. Explicitly it 
minimises 

jL(w hayes , w)p(w|data)dw (14) 
from which w bayes can be found. 

It is instructive to consider a simple example where we are estimating the log distance, 
w, of a galaxy. Let us assume that we have already an unbiased (in the sense of equation 
6) 'raw' estimator, w, based on some distance indicator, which we shall for expediency 
take to be normally distributed about the true log distance w with variance a 2 . Thus the 
conditional distribution for w given w is 

/ ~ I \ 1 r 1,W-W 2 

p(w w) = -==exp --( ) J (15) 



Let us assume, however, that the galaxy is randomly selected from some underlying 
population with true log distance w distributed normally about some mean value, w c , 
and variance a 2 - where the subscript c refers to the catalogue from which the galaxy is 
drawn. This normal distribution is taken to be the prior, so in the above notation 

P( W ) = /o-- eX P[-o( ) ] ( 16 ) 

It is now straightforward to show that the conditional distribution for w given the 
value of w is normally distributed with mean, w B and variance a% given by 

wb = ~ TW - (17) 

and 

4 = ^-o (18) 



1 + 13 

where (3 = a 2 /a 2 , from which it follows that 

w + f3w c 

w baycs = 1+ (19) 

The interpretation of this result is very straightforward. If the variance of the in- 
dicator is much smaller than the population variance of the normal distribution of true 
log distance for observable galaxies then f3 ~ and one obtains essentially the 'raw' log 
distance estimator, w, suggested by the indicator. If, on the other hand, the indicator 
provides very poor information about the distance of the galaxy then f3 is very large, and 
the Bayes estimator yields approximately w C) the mean true log distance of the observable 
galaxies in the catalogue. This simple example demonstrates that, provided the scatter 
in one's distance indicator is sufficiently small, one obtains essentially the same estimator 
irrespective of whether one adopts a Bayesian approach or not - and the estimator is thus 
largely insensitive to the prior information. The role of the prior becomes increasingly 
important, however - and the difference between a Bayesian and frequentist approach 
becomes more apparent - as the scatter in the distance indicator increases. 

One can regard equation (19) as defining a correction to w based on the prior in- 
formation - in this case that the underlying populatoin of true log distance is normally 
distributed. Corrections of this type have come to be known in the cosmology literature 
as Malmquist corrections, and in the context of mapping large scale structure they were 
initially applied assuming the distribution of galaxies to be spatially homogeneous (c.f. 
Lynden-Bell et al, 1988; Dekel et al, 1993) - just as the distribution of stars had been as- 
sumed homogeneous in the original analytical treatments of Malmquist (1920, 1922) and 
Eddington (1913). Recently, however, attempts have been made to apply more general, 
inhomogeneous, Malmquist corrections which address the fact that the galaxy distribu- 
tion displays small-scale clustering (c.f. Landy & Szalay, 1992; Hudson, 1994; Dekel, 1994; 
Newsam, Simmons & Hendry, 1995; Hudson et al, 1995; Freudling et al, 1995). We briefly 
consider some important technical problems regarding the application of inhomogeneous 
Malmquist corrections in section 3. It is worth noting here, however, that an entirely 
frequentist approach to distance estimation has the advantage that the definition of an 
unbiased estimator is completely independent of the underlying galaxy true number den- 
sity - and hence is unaffected by arguments about the form of prior distribution which 



3 GALAXY DISTANCE INDICATORS IN RECENT LITERATURE 



Most redshift-independent methods of estimating galaxy distances which have featured in 
the recent cosmological literature are based upon secondary distance indicators - which 
require to be calibrated using a sample of galaxies in, e.g., a nearby cluster, the distance 
of which is already known. Notable exceptions to this have been the recent extension 
to beyond the Local Group of the extragalactic distance scale measured from Cepheid 
variables and the application of the expanding photosphere method (EPM) to determine 
the distances of type II supernovae (SN). Both Cepheids and type II SN are examples of 
primary distance indicators which can be calibrated either locally - within our own galaxy 

- or from theoretical considerations. For a discussion of the physical basis for these in- 
dicators the reader is referred to, e.g., Kirschner & Kwan (1974), Eastman & Kirschner 
(1989), Jacoby et al (1992) and references therein. Both indicators have a small intrinsic 
dispersion (~ 10 — 15% to individual objects) and are thus considerably less susceptible to 
the problems which arise in the definition of Malmquist corrections and sensitivity to the 
choice of prior information (essentially because the /3, the ratio of the estimator variance 
to the variance of the underlying population, is small). This property of course makes 
both Cepheids and type II SN well suited to the estimation of the Hubble constant - either 
directly or in combination with other secondary indicators such as type la SN (c.f. Saha 
et al, 1994; 1995). Indeed, the high estimates of Ho reported in Freedman et al (1994) 
and Pierce et al (1994), based on the distance of Cepheids in Virgo cluster galaxies, and 
those of Schmidt et al (1992, 1994) based on the EPM distances of type II SN beyond 
the Local Supercluster (and thus less adversely affected by peculiar velocities), provide a 
compelling argument in favour of a value of Ho > 60kms^ 1 Mpc _1 - despite the difficulties 
of reconciling these results with astrophysical estimates of the age of the galactic disc (c.f. 
van den Bergh 1995; Chamcham & Hendry, 1995). 

Of the secondary distance indicators currently in widespread use, only two are thought 
to be sufficiently accurate to make the question of how to best use prior information es- 
sentially unimportant: these are surface brightness fluctuations (SBF) and the luminosity 

- light curve shape relation for type la SN. The former distance indicator, SBF, was pi- 
oneered by Tonry & Schneider (1988) and is based upon the fact that the fluctuations - 
due to the discreteness of individual stars - in surface brightness across the CCD image of 
a nearby elliptical galaxy will be larger than those for a more distant galaxy. The phys- 
ical basis of SBF and the details of its calibration are described in Jacoby et al (1992). 
Relative distances of a typical accuracy of 5% have been derived to a sample of several 
hundred ellipticals out to a redshift of around 6000 kms -1 using this indicator (Dressier, 
1994). 

Type la SN have long been recognised as useful 'standard candles' since they are ob- 
servable to very large distances and have a luminosity function which is well described a 
Gaussian distribution of dispersion around 0.5 mag. (Sandage & Tammann, 1993; Hamuy 
et al, 1995). In Vaughan et al (1995), it is argued that the pre-selection of SN based on 
a colour criterion reduces this dispersion to ~ 0.3 mag., which - although a significant 
improvement - still represents a typical percentange distance error of around 15% to an 
individual galaxy. In Riess, Press & Kirschner (1995a,b) however, the shape of the SN 
light curve is used to more tightly constrain the peak luminosity and leads to a typical 
relative distance error of only 5% - small enough to render Malmqust corrections largely 



unimportant. This method has been used both to estimate the Hubble constant and to 
determine the bulk flow motion of the Local Group on a scale of ~ 7000 kms -1 , yielding a 
motion which is consistent with the COBE measurement of the dipole anisotropy in the mi- 
crowave background radiation, but inconsistent with the dipole motion reported by Lauer 
& Postman (1994), based on the redshifts of Abell clusters at distance of 8000 — 11000 
kms -1 . 

The vast majority of recent analyses of the peculiar velocity and density fields, and 
the estimation of the density parameter Q using redshift-independent distance indicators, 
have been carried out primarily with the Tully-Fisher (TF) and D n — a distance indicator 
relations for spirals and ellipticals respectively. As we remarked above, the TF relation 
essentially expresses a power law relationship between the luminosity and rotation veloc- 
ity for spiral galaxies; the D n — a relation similarly expresses a power law relationship 
between the central velocity dispersion and isophotal diameter of early-type galaxies (c.f. 
Jacoby et al, 1992). Although the number of galaxy distances estimated by these two 
relations currently stands at over 4000 (around a factor of ten larger than the number of 
distance estimates from SBF and SN distance indicators), and continues to grow rapidly 
each year, both the TF and D n — a relations are considerably more noisy - with disper- 
sions of around 20% to individual galaxies. It is for this reason that the issue of how 
- or indeed if - one should make use of prior information in the definition of 'optimal' 
estimators continues to be regarded as of crucial importance when interpreting the results 
of applying these distance indicators to analyse redshift surveys. 

Both the TF and D n — a relations are usually calibrated by performing a linear re- 
gression on a calibrating sample of galaxies whose distances are otherwise known. It 
is instructive to consider this calibration procedure in more detail, in order to illustrate 
some of the statistical pitfalls which may arise, for the generic example of the TF relation. 
As before, we denote the log rotation velocity by P and let M denote the estimator of 
absolute magnitude which one derives from the TF relation, from which one may derive 
the corresponding 'raw' estimator of log distance, w, from equation (3) in the obvious 
way. Thus, we obtain from the calibration a linear relationship between M and P, 

M = «P + (3 (20) 

where a and (3 are constants. The choice of which linear regression is most appropriate 
is non-trivial when one's survey is subject to observational selection effects. We can 
demonstrate this with the following simple example. Suppose that the intrinsic joint 
distribution of absolute magnitude and log(rotation velocity) is a bivariate normal. Figure 
1 shows schematically the scatter in the TF relation in this case, for a calibrating sample 
which is free from selection effects - e.g. a nearby cluster. (More precisely, the ellipse 
shown is an isoprobability contour enclosing a given confidence region for M and P). 
The solid and dotted lines show the linear relationship obtained by regressing rotation 
velocities on magnitudes and magnitudes on line widths respectively. Thus the dotted 
line is defined as the expected value of M at given P, while the solid line is defined as the 
expected value of P at given M. Since in practice one wishes to infer the value of M from 
the measured value of P, the M on P regression has been referred to in the literature as 
defining the 'direct' or 'forward' TF relation, while using the P on M regression defines 
the 'inverse' TF relation. For the bivariate normal case the equations of the direct and 



Figure 1: Schematic 'Direct' and 'Inverse' Tully-Fisher relations for the case of a nearby, 
completely sampled, cluster. 

inverse regression lines are as follows:- 

£(M|P) = M + ^(P-P ) 
(j P 

E(P\M) = P + — (M-M ) (22) 

where M , P , <r M , <t p and p denote the means, dispersions and correlation coefficient of 
the bivariate normal distribution of M and P. Both regression lines can be written in 
the form of equation (20), thus defining M as a function of P, although of course the 
constants a and (3 are different in each case. Moreover the definition of M is subtly 
different in each case. For the direct regression M is identified as the mean absolute 
magnitude at the observed log line width. For the inverse regression on the other hand 
M is defined such that the observed log line width is equal to its expected value when 
M = M. Consequently, as is apparent from their slopes, the direct and inverse regression 
lines give rise to markedly different distance estimators, although it is straightforward to 
show that in the absence of selection effects both estimators are unbiased, in the sense 
defined in equation (4), above. 

The situation is very different when we include the effects of observational selection, 
however. This is illustrated in Figure 2, which shows the scatter in the TF relation in 
a calibrating sample subject to a sharp cut-off in absolute magnitude - as would be the 
case in e.g. a distant cluster observed in an apparent magnitude-limited survey. We can 
see that in this case the slope of the direct regression of M on P is substantially changed 
from that in the nearby cluster - indeed the direct regression is no longer linear at all. 
This means that if one calibrates the TF relation in the nearby cluster using the direct 
regression and then applies this relation to the more distant cluster, one will systemat- 
ically underestimate its distance, since the expected value of M given P in the distant 
cluster is systematically brighter than that in the nearby cluster as fainter galaxies pro- 
gressively 'fade out' due to the magnitude limit. The corresponding 'direct', or 'M on P', 
log distance estimator will therefore be negatively biased. 



(21) 



Figure 2: Schematic 'Direct' and 'Inverse' Tully-Fisher relations for the case of a distant 
cluster subject to a sharp selection limit on absolute magnitude. 

In an important paper Schechter (1980) observed that the slope of the inverse regres- 
sion line is unchanged, irrespective of the completeness of one's sample, provided that the 
selection effects are in magnitude only. We can see that this observation is valid in the 
simple case considered in Figure 2. In other words the expected value of P given M is un- 
affected by the selection effects and, therefore, defines an unbiased log distance estimator. 
In Hendry & Simmons (1994), Schechter's result is derived within the rigorous framework 
of mathematical statistics, and the assumptions upon which it is based are generalised. In 
particular it is shown that the inverse TF log distance estimator is gaussian and unbiased 
at all true log distances provided only that the conditional distribution of P given M is 
Gaussian, that i?(P|M) is a linear function of M, and that the sample is not subject to 
selection on rotation velocity. Moreover, since the inverse log distance estimator is Gaus- 
sian it will also automatically be a sufficient and efficient estimator, as defined in section 
2. In Hendry (1992a) it was also shown that when there is no selection on rotation velocity 
then the inverse log distance estimator is the only unbiased estimator which is a linear 
function of log rotation velocity and apparent magnitude. In particular, the 'orthogonal' 
(c.f. Giraud 1987), 'bisector' (c.f. Pierce & Tully 1988) and 'mean' (c.f. Mould et al 
1993) regression lines also give rise to estimators which are biased at all true log distances 
in this case. A similar conclusion was also reached in Triay, Lachieze-Rey & Rauzy (1994). 

The unbiased properties of the inverse TF relation have led to its use in defining a 'raw' 
distance estimator in a number of different recent analyses of the peculiar velocity field, 
including Newsam, Simmons & Hendry (1995), Freudling et al (1995), Nusser & Dekel 
(1994), Shaya, Tully & Pierce (1992) and Shaya, Tully & Peebles (1995). Its acceptance 
has been far from universal, however. Part of the reason for this is that, of course, in 
practice it is not the case that galaxy samples are free from selection effects on rotation 
velocity. In fact, it is commonly the case that redshift surveys are first selected on the 
basis of either apparent diameter or B-band apparent magnitude, or both, while the TF 
photometry is then carried out in the near infra-red, or I-band. This leads to a con- 
siderably more complex selection function, as modelled in Sodre & Lahav (1993), which 
in general renders all linear regressions biased. Essentially this problem arises because 



diameter, I-band and B-band magnitude and rotation velocity are mutually correlated 
variables, so that the selection on B-band magnitude and angular diameter 'pollutes' the 
joint distribution of I-band magnitude and rotation velocity in the TF relation - thus 
effectively rendering the assumptions inherent in deriving the unbiasedness of the inverse 
TF relation no longer valid (c.f. Hendry & Simmons 1994; Willick 1994). 

One can determine the correct slope and zero point of the 'direct' TF relation from 
a cluster subject to observational selection effects by the application of straightforward 
iterative procedure - thus solving what has been termed in the literature as the 'cali- 
bration problem' (c.f. Willick 1994; Hendry et al, 1995). It is important to recognise, 
however, that the corresponding 'raw' log distance estimator will still be biased, in the 
sense of equation (4), at all true distances if applied to a galaxy survey subject to magni- 
tude selection effects. This is because the joint distribution of absolute magnitude and log 
rotation velocity for observable galaxies will not be equal to the intrinsic joint distribution. 

Why has the use of the 'direct' TF relation in recent literature continued to be 
widespread? To understand the reason for this we must first note that most recent anal- 
yses of galaxy distances and peculiar velocities have been carried out within a Bayesian 
framework, thus involving the application of Malmquist corrections to the 'raw' log dis- 
tance estimator. The motivation for adopting a Bayesian approach (even if the Bayesian 
nature of the problem has not always been explicitly acknowledged by authors!) comes 
about from the way in which galaxy distance estimates and redshifts have been combined 
in the majority of analyses. In both the early 'toy' parametric velocity field models of 
e.g. Lynden-Bell et al (1988), Dressier & Faber (1990), and the more sophisticated re- 
construction methods such as POTENT (c.f. Dekel et al 1990), essentially galaxies are 
binned and grouped together and assigned radial peculiar velocity estimates on the basis 
of their estimated distance. The galaxy's actual distance could be radically different, and 
will depend on the true spatial distribution of galaxies and the exact nature of the survey 
selection function. Clearly galaxies which have small estimated distance are more likely 
to have been scattered down from larger true distances, since a volume element of fixed 
solid angle increases in size with true distance; close to the limit of the survey volume, 
however, this might no longer be the galaxies scattered from larger true distances 

might be too faint to be included in the redshift survey. By requiring that on average the 
actual radial coordinate of the galaxy be equal to its estimated distance, one would also 
ensure that on average the correct peculiar velocity would be ascribed to that galaxy's 
apparent position. The estimator which satisfies this condition can be defined following 
the Bayesian approach outlined for the simple illustrative example of section 2, and it 
is straightforward to show that such a 'Malmquist corrected' distance estimator, ?b ay cs, 
satisfies the equation 



where C is a normalisation constant. 

The key point about equation (23) is that - as before - the Bayesian distance esti- 
mator depends upon the prior distribution of true log distance, p(w). There has been 
no consensus in the literature on which prior one should adopt. As we mentioned in sec- 
tion 2, in Lynden-Bell et al (1988) the prior is assumed to correspond to a homogeneous 
distribution of galaxies - thus defining homogeneous Malmquist corrections which are a 
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function only of distance. In Landy & Szalay (1992), on the other hand, a more general 
correction is derived by first estimating p(w) from a spline fit to the histogram of log 
distance estimates for the galaxies in the survey, thus in principle taking into account 
inhomogeneities in the galaxy distribution. Due to the sparseness of surveys, however, it 
is usually necessary to average the distribution of galaxies over large solid angles, if not 
all, of the sky. Therefore, the effects of clustering may still go largely unaccounted for 
in the Landy & Szalay prescription (c.f. Newsam et al, 1994). In other recent analyses 
(c.f. Hudson 1994; Hudson et al 1995; Dekel 1994; Willick 1994; Freudling et al 1995) a 
different method is proposed for obtaining the prior distribution - by reconstructing the 
density field of optical or IRAS-selected galaxies based on redshifts alone, assuming linear 
or mildly non-linear theory to adequately describe the gravitational collapse of structure 

- smoothed on a scale of the order of 10 Mpc. 

In all of the above analyses the Malmquist corrections are derived assuming that 
the conditional distribution of the 'raw' log distance estimator, p(w|w), is normally dis- 
tributed at all true log distances. As shown in Hendry & Simmons (1994), this assumption 
is invalid when the 'raw' estimator is derived from the 'Direct' TF relation. Thus, the 
formula of Landy & Szalay will result in an incorrect Malmquist correction due to the 
bias of the 'Direct' TF log distance estimator. In general, if the prior distribution of 
true log distance is inferred from the observed distribution of log distance estimates, then 
one must apply the formula of Landy & Szalay using the 'Inverse' TF estimator - which 
we have seen is normally distributed and unbiased at all true log distances, subject to 
the conditions specified above and in Hendry & Simmons (1994). A similar conclusion is 
reached in Teerikorpi (1993), Feast (1994) and Freudling et al (1995). 

It is further shown in Hendry & Simmons (1994) that the use of the 'Direct' TF rela- 
tion as the raw log distance estimator in defining general Malmquist corrections can only 
be justified if the prior distribution in equation (23) corresponds to the intrinsic distri- 
bution of true log distance. As a special case of this result, note that the homogeneous 
Malmquist correction of Lynden-Bell et al (1988) applied to the 'Direct' TF estimator 
will therefore be valid provided that the intrinsic distribution of galaxies is homogeneous. 
In a similar way, the inhomogeneous corrections derived in Hudson et al (1995), Dekel 
(1994) and Freudling et al (1995), will be valid provided the density field reconstructed 
from optical or IRAS-selected surveys corresponds to the intrinsic distribution of true log 
distance for the TF galaxies - in other words that the selection function of the redshift 
survey has been adequately corrected for, and the redshift survey faithfully traces the 
same underlying population as the galaxies to which the TF relation is being applied. 

4 CONCLUSIONS 

In this paper we have set out to describe - within the framework of mathematical statistics 

- some of the properties of 'optimal' galaxy distance estimators, including unbiasedness, 
sufficiency and efficiency. We have shown that the intrinsic scatter of indicators such 
as the Tully-Fisher and D n — a relations is sufficiently large that the question of which 
statistical philosophy one should adopt in the analysis of redshift surveys is far from triv- 
ial. In particular we have seen that one may formulate the problem of galaxy distance 

pstimatinn as a Ravpsian infprpnrp nrnhlpm — psspntiallv thf annrnarh whirh has hppn 



adopted implicitly in the literature in defining Malmquist-corrected distance estimators - 
but that there is no general agreement over the issue of how one should then best make 
use of prior information on the distribution of true galaxy distances. In particular, a fail- 
ure to adequately understand the properties of the 'raw' galaxy distance estimator used 
can lead to the definition of invalid Malmquist corrections, as was the case in e.g. Landy 
& Szalay (1992). In Newsam, Simmons & Hendry (1995) we show that the use of such 
invalid corrections can frequently be worse than applying no corrections at all. A similar 
conclusion was reported in Freudling et al (1995), where it was shown that a number of 
biases may have gone unresolved in earlier attempts to incorporate prior information in 
the definition of distance estimators. 

In reality, the issue of defining an 'optimal' galaxy distance estimator is only the first 
part of the story. In applying the POTENT procedure, for example, whether or not a 
distance estimator is biased is not the crucial question; what is important is to construct 
an unbiased smoothed peculiar velocity field. Although there appears some justification 
as to why this procedure requires the application of an essentially Bayesian approach, the 
Malmquist corrections which this approach entails are strictly only valid if galaxies are 
not too sparse, the gradient of the velocity field is not too large, and the effective radius 
of the window function used to smooth the data is not too wide. In Newsam, Simmons 
& Hendry (1995) a Monte-Carlo procedure, involving the generation of large numbers of 
'mock' redshift surveys, is devised and implemented with the purpose of eliminating all 
biases from the POTENT-recovered velocity and density fields - not only those associ- 
ated with the scatter of the distance indicators. A similar algorithm may be adopted for 
other reconstruction methods, and has the distinct advantage of being easily adapted to 
more general (and more realistic!) selection functions and distance indicators - involving, 
e.g., correlations between three or more observables where a wholly analytic treatment 
can often be intractable (c.f. Hendry & Simmons, 1994). A very similar Monte-Carlo 
approach has been adopted in Freudling et al (1995). These papers serve as an important 
reminder that the question of galaxy distance estimation cannot be regarded in isolation: 
ultimately the choice of which distance estimator is 'optimal' depends on the context in 
which the distance estimator is being used. 

It is perhaps worthwhile to end on a positive note. The use of redshift independent 
galaxy distance indicators in conjunction with redshift surveys has opened up an exciting 
- and highly productive - 'industry' in cosmology during the past decade or so. Although 
the statistical problems arising from the large intrinsic scatter of these indicators are con- 
siderable, the mathematical machinery briefly sketched in this paper equips us with the 
necessary tools to address important issues such as their sensitivity to the choice of prior 
information. Moreover, the significant recent advances made in developing and applying 
more accurate distance indicators, such as surface brightness fluctuations and the super- 
nova light curve shape method, offer some further cause for optimism: perhaps within the 
next decade we will be able to map the large scale structure of the local universe with 
sufficient accuracy that the question of whether one should adopt a Bayesian approach 
to the analysis - and how in detail it should be implemented - will no longer be important. 

Putting this another way, in the general statistics literature on Bayesian inference, 
when one's results are sensitive to the choice of prior information, one is usually advised 
to ero out in search of better data,. Fortunatelv for those cosmoloeists measuring eralaxv 



distances, it appears that such data are indeed on their way! 
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